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Abstract 

The Facioscapulohumeral Muscular Dystrophy (FSHD) Is an autosomal dominant neuromuscular disorder whose incidence is 
estimated in about one in 400,000 to one in 20,000. No effective therapeutic strategies are known to halt progression or 
reverse muscle weakness and atrophy. It is known that the FSHD is caused by modifications located within a D4ZA repeat 
array in the chromosome 4q, while recent advances have linked these modifications to the DUX4 gene. Unfortunately, the 
complete mechanisms responsible for the molecular pathogenesis and progressive muscle weakness still remain unknown. 
Although there are many studies addressing cancer databases from a machine learning perspective, there is no such 
precedent in the analysis of the FSHD. This study aims to fill this gap by analyzing two specific FSHD databases. A feature 
selection algorithm is used as the main engine to select genes promoting the highest possible classification capacity. The 
combination of feature selection and classification aims at obtaining simple models (in terms of very low numbers of genes) 
capable of good generalization, that may be associated with the disease. We show that the reported method is highly 
efficient in finding genes to discern between healthy cases (not affected by the FSHD) and FSHD cases, allowing the 
discovery of very parsimonious models that yield negligible repeated cross-validation error. These models in turn give rise 
to very simple decision procedures in the form of a decision tree. Current biological evidence regarding these genes shows 
that they are linked to skeletal muscle processes concerning specific human conditions. 
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Introduction 

The Facioscapulohumeral Muscular Dystrophy (FSHD) is an 
autosomal dominant neuromuscular disorder and the third most 
common inherited muscular dystrophy [1,2]. Its incidence may 
vary in different places and probably in different racial groups, but 
recent estimates account for one in about 400,000 to one in 20,000 
[3]. FSHD patients show progressive weakening and atrophy of 
the muscles in the face, slowly progressing to the shoulder, upper 
arm muscles and shoulder girdle, down to the stomach and lower 
limbs. Inability to flex the foot upward, foot weakness, and an 
onset of right/left asymmetry are also common symptoms [4,5]. 

Although the FSHD is considered a relatively benign dystrophy, 
about 20% of the patients presenting this disorder are eventually 
restrained to a wheel chair. The age of onset is variable, being the 
second decade of life the most common stage where patients 
become symptomatic. In some cases, however, symptoms never 
develop even when the individual has the mutation associated with 
the FSHD. 

No effective therapeutic strategies are known to either halt 
progression or reverse muscle weakness and atrophy in the FSHD 
[6]. However, there are a number of actions that can provide 
symptomatic and functional improvement in many patients. In 
particular, the use of assistive devices -such as braces, standing 
frames, or walkers— is of great help. Physical therapies like 
exercises in water, complemented by psychological support and 
speech therapy may also alleviate specially difficult life conditions. 



It is known that the FSHD is caused by deletion of a subset 
of D4Z4 macrosatellite repeat units in the subtelomere of 
chromosome 4q [7]. D4Z4 modification needs to occur on a 
specific chromosomic background to cause the FSHD. More 
than 95% of patients with clinical FSHD have an associated 
D4Z4 deletion on the 4q35 chromosome. However, a smaU 
number of kindreds with clinically typical FSHD do not present 
this dynamic. A second FSHD locus has not yet been identified 
[8]. Recent advances involve the DUX4 gene, a retrogene 
sequence within D4Z4 that encodes a double homeodomarn 
protein whose exact function is not entirely known. Although 
the proper mechanisms responsible for the progressive muscle 
weakness still remain unknown, the study of this gene could 
offer a possible therapeutic way [7]. 

It is generally believed that the monitoring of expression levels 
for thousands of genes simultaneously may lead to a more 
complete understanding of the molecular variations among 
different cell conditions. In the literature on machine learning, 
contributions concerning the analysis of gene expression FSHD 
data are very scarce, probably because of unawareness towards the 
highly rare diseases. The situation is aggravated by the absence of 
scientific data outside purely medical domains, in order to attack 
the problem from a different point of view. In contrast, there is 
now a vast body of available datasets about microarray gene 
expression analysis when focused to cancer diseases. Specifically, 
microarray gene expression databases have been used to 
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discriminate between tumours or tumour subtypes, and to study 
biological properties of tumours -see, e.g., [9]. 

Over the last decade, Machine Learning (ML) has made 
significant inroads in the fields of bioinformatics and biomedicine 
[10]. Specifically, cancer research has applied a variety of ML 
algorithms for tumor prediction by associating expression patterns 
with clinical outcomes for patients with tumors [1 1]. The majority 
of this research has focused on building accurate classification 
models from reduced sets of features. Some of these analyses also 
aim to gain understanding of the differences between normal and 
malignant cells and to identify genes that are differentially 
regulated during cancer development. The importance of the 
vahdity and reproducibility of statistical analysis and reporting 
cannot be stressed enough [12]. 

Typically, a gene expression data set may consist of dozens of 
samples but with thousands or even a few tens of thousands of 
genes (acting as features, using the ML terminology). Predictive 
model construction using this very high ratio between number of 
features and number of samples is a delicate undertaking, prone to 
obtain unreliable readings. As a result, dimensionality reduction 
and in particular feature selection techniques may be very useful, as a 
way to reduce the problem complexity and lighten medical expert 
diagnosis. 

Of special importance in a practical medical setting is the 
interpretability of the obtained solutions, something that limits the 
applicability of methods such as PCA or ICA (whose solutions 
involve weighted combinations of genes, instead of individual 
genes). Moreover, in a medical context, data visualization in a low- 
dimensional representation space may become extremely important, 
as it would help doctors to gain insights into this complex and 
highly sensitive domain. The development of predictive models 
able to discern between healthy and FSHD samples with minimal 
error rate and amenable to direct interpretation is thus a clear 
research goal. When predictive models use very low numbers of 
relevant genes, these genes are likely to be associated with the 
disease, and can be used as a starting mechanism for further 
dedicated study from a biological point of view. 

The present study addresses all these issues in two FSHD 
databases (named, just for reference in this paper, as FSHD-DB 1 
and FSHD-DB2) to discern between healthy and FSHD samples 
(clinical cases). We report experimental results supporting the 
practical advantage of combining robust feature selection and 
classification in the analyzed FSHD datasets. The described 
method is able to unveil two groups of genes that yield very low 
mean cross-vahdation error. These genes can be used to build very 
simple decision procedures in the form of a decision tree. 

Results and Discussion 

FSHD-DB1 Database 

The feature selection process in Algorithm 1 comes to a final 
solution in the form of a subset with only three genes and a 100% 
of mean 5 x 5 cv accuracy. This final subset is presented in Table 1 
including its gene IDs and fuU names. It will be hereafter referred 
as the FSHD-DB 1 model. In comparison, PAMR delivers a 96.8% 
of mean 5 x5 cv accuracy with 2 genes (Table 2), and SVM-RFE 
delivers a comparable 99.4% mean 5x,5 cv accuracy, using 5 
genes (Table 3). As a further comparison, if we consider the two 
genes signaled as most relevant in the literature [DUX4 [7] and 
FRGl [1 3]), the corresponding mean 5 x5 cv accuracy of these two 
genes (taken together) is 84.65%. 

Visualization. Data visualization in a low-dimensional rep- 
resentation space is extremely important to gain a better 
understanding of the solution delivered by the process. To 



Table 1. Best gene subset found using the proposed method 
and LDA as performance measure in FSHD-DB1 (the FSHD- 
DB1 model). 





Probe set ID 


Gene 


Name 


201088_ot 


KPNA2 


karyopherin alpha 2 (RAG cohort 
1, importin alpha 1) 


219746_at 


DPF3 


D4, zinc and double PHD fingers, 
family 3 


201552_at 


LAM PI 


lysosomal-associated membrane 
protein 1 
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visualize the result, the data corresponding to the FSHD-DB 1 
model are plotted using the three selected genes as axes, without 
any pre-processing method or projection technique -Fig. 1. In 
addition, the LDA decision boundary fitted in the whole data set is 
shown. The FSHD group presents a less compact distribution, 
while the Healthy group is clustered around a specific region of the 
representation space given by the three genes found. It can be seen 
that the two conditions are neatly separated. 

Figure 2 shows a box plot for each gene in the FSHD-DB 1 
model. LAA'IPl shows a mean value for FSHD samples of 
2087.90+157.04, against Healthy with mean 1092.91 + 54.41; 
DPF3 shows a more even expression level, FSHD with 
1005.78 + 83.79 and Healthy with 881.93 + 40.73; KPNA2 tends 
to up-regulate heavily in FSHD (mean 788.81 +78.92, compared 
to Healthy with 305.92 + 23.59). 

Figure 3 depicts a dendrogram of cases and standardized gene 
expression levels for the FSHD-DB 1 model. Each case is identified 
with an ID number, prefixed by a letter indicating class 
membership, //for Healthy and i^for FSHD. It is apparent that 
LAMPl shows an up-regulation in most of the FSHD cases, as well 
as KPMA2; DPF3 shows a slightly difiTuse expression level. 

Nonetheless, it is noticed in Fig. 3 that the natural clusters do 
not necessarily correspond to labeled samples, and thus supervised 
information is needed to create accurate prediction models, even 
in this low-dimensional representation. Three clusters are discov- 
ered: a first one (HI to HI 7), in which most (but certainly not all) 
of the samples belong to Healthy class; a second group (HI 8 to 
H22), containing three Healthy and two FSHD samples; finally, a 
third group (from F23 on) which is completely messed up. This 
result -although is certainly dependent on the limitations of 
clustering methods- alerts against using unsupervised feature 
extraction methods like PCA. 

An interesting point to be emphasized in these graphic 
representations is that the FSHD-DB 1 model clearly clusters the 
two conditions iieady -Fig. 1. We were therefore interested in 
ascertaining to what extent is this result stable and may thus 
constitute a good departing point for future studies. To this end, 
we performed two further investigations: 



Table 2. Best gene subset found using PAMR in FSHD-DB1. 




Probe set ID Gene 


Name 


218959_at HOXCIO 


homeobox CIO 


215000_s_of FEZ2 


fasciculation and elongation protein 




zeta 2 (zygin II) 
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Table 3. Best gene 


subset found 


using SVM-RFE in FSHD-DB1. 




Probe set ID 


Gene 


Name 


202594_at 


LEPROTLl 


leptin receptor overlapping transcript-lil<e 1 


208065_of 


ST8SIA3 


ST8 aipha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 3 


209797_at 


CNPY2 


canopy 2 homolog {zebrafish} 


215000_s_ot 


FEZ2 


fasciculation and elongation protein zeta 2 (zygin II) 


218959_af 


HOXC10 


homeobox CIO 


doi:l 0.1 371 /journal.pone.0082071 .t003 



1 . The first action was to change the resamphng method to 1 0 
times 10-fold cross validation (10x10 cv). This form of 
resampling entails a much higher computational cost; however, 
it has been suggested as adequate for small sample situations 
[14]. 

2. The second action was to analyze the statistical diflferences 
between FSHD vs. Healthy samples in the expression levels for 
the genes in the model. In addition, we explored the possibility 
that a single gene is able to (almost) perfectly separate the two 
classes by mere chance. 

Statistical analysis. We were interested in exploring the 
effect of changing the resampling method, keeping the same 
classifier (LDA in this case), in order to exclude this source of 
variation from the analysis. Remarkably, using 10x10 cv instead 
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Figure 1. LDA decision surface for tlie FSHD-DB1 model. 
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of 5x5 cv in Algorithm 1, it was found that the final result fuUy 
coincided with the FSHD-DB 1 model. 

In order to assess statistical significance of expression levels, the 
Mann-Whitney U-test (MWU) was used in the comparison 
between FSHD vs. Healthy samples in the model. This is a non- 
parametric hypothesis test for assessing whether one of the two 
conditions (FSHD in this case) tends to have larger values than the 
other. 

For KPNA2, medians for the two groups Healthy and FSHD 
were 307.41 and 736.98; the distributions in the two groups 
differed significantly (Mann- Whitney W = 243, /i-value 
2.06-10-^). 

For DPF3, medians for the two groups Healthy and FSHD were 
888.61 and 867.44; the distributions in the two groups did not 
differed significandy (Mann- Whitney W = 150, p-value 0.1886). 

For LAMPl, medians for the two groups Healthy and FSHD 
were 1088.76 and 2000.11; the distributions in the two groups 
differed significantly (Mann- Whitney W = 243, /?- value 
2.06-10-^). 

Therefore both KPMA2 and LAMPl genes present high 
differences in the two conditions. Although these two genes are 
not equal, they present notable similarities. Spearman's rank 
correlation coefficient is equal to 0.771. This fact will be used to 
simplify the FSHD-DB 1 model. 

One may still wonder about the probability of finding such a 
single gene like KPMA2-t)ial separates the two conditions with one 
exception- by mere chance (Fig. 2). If a gene bears no relation 
with the disease, we could expect an arbitrary pattern for the 
distribution of the two conditions (healthy vs. FSHD cases) across 
the expressed values of the gene. The probability that one or more 
genes in 22,283 separates the two conditions (14 FSHD; 18 
healthy) with only one exception is found to be around 7.9-10^''. 

A final interpretable model. Even though the LDA 
decision boundary in Fig. 1 depicts a clean separation between 
the two patient conditions, its application as a decision tool may 
not be straightforward. In this sense, decision trees are one of the 
preferred tools by experts in decision making processes. Moreover, 
the final selection of a gene subset may stiU provide few clues about 
the structure of the two conditions with respect to their expression 
levels. Some accuracy may be sacrificed for increased interpret- 
abUity of the model. 

Figure 4 shows a CART decision tree [15] built with the FSHD- 
DB 1 model. The main question is on the expression level of gene 
KPNA2: the right branch corresponds to 1 3 (all but one) of the 
FSHD patients; the left branch corresponds to all of the 18 healthy 
ones plus the remaining FSHD patient. Moreover, one may 
wonder if there is a second gene, expressed such that it separates 
this specific patient from the 18 healthy ones, and indeed there is 
one: precisely DPF3. Whether this last patient is an oudier in a 
medical sense we cannot know, but it deserves further clinical 
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Figure 2. Box plots for the expression levels of the genes in the FSHD-DB1 model. 

doi:1 0.1 371 /journal.pone.0082071 .g002 



investigation. Therefore, despite LAAIPl .shows a markedly 
differential expression, it may be excluded from the decision flow. 

Biological evidence. In this section, we compile scientific 
knowledge about the two genes in the final subset, including their 
known primary functions in cellular process. 

KPNA2. RmA2 is Karyopherin alpha 2 (RAG cohort 1, 
importin alpha 1). It is known that muscle functions are dependent 
on spatial and temporal control of gene expressions in myofibers. 
These are multinucleated cells that contain hundreds of nuclei 
spread across the length of the cell in a common cytoplasm. Their 
very important role is to control the transcriptional activity of 
several nuclei in a common cytoplasm [16]. 

Analyzing the role of karyopherin alpha (KPNA) and paralogs- 
specific roles oi KPNAl and KPMA2 during myogenesis, it has been 
found that these two genes do regulate myoblast proliferation. 
Particularly, KPNA2 regulates myotube size and myocyte migra- 
tion [17]. Therefore, both may be involved in the nuclear 
transport of proteins [18], which has a key role in controlling gene 
expression in skeletal muscles. 



DPF3. DPF3 is D4, zinc and double PHD fingers, family 3. 
This gene belongs to the neuron-specific chromatin remodeling 
complex (nBAF complex), acting as a tissue-specific anchor 
between histone acetylations and methylations and chromatin 
remodeling [18,19]. Experiments in human cardiac samples and 
mouse embryonic and adult hearts showed that it plays a role in 
heart and skeletal muscle development [20]. It also presents an up- 
regulated expression in patients with Tetralogy of Fallot, a congenital 
heart defect, partially characterized by muscular hypertrophy. 

FSHD-DB2 Database 

The feature selection process in Algorithm 1 comes to a final 
solution with six genes and 99.6% of mean 5 x5 cv accuracy. This 
final subset is presented in Table 4 including its gene IDs and fuU 
names (of which two of them are yet unknown). It will be hereafter 
referred as the FSHD-DB2 model. In comparison, PAMR delivers 
a 70.4% of mean 5x5 cv accuracy with 3 genes (Table 5), and 
SVM-RFE delivers 85.2% mean 5x5 cv accuracy, using 5 genes 
(Table 6, of which three of them are unknown). This database 
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Figure 3. Clustering of the expression levels of the genes in the FSHD-DB1 model. Left: by genes; Top: by samples. 
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contains DUX4 entries, corresponding to 4 isoforms. If we consider 
the most informative model, including the 4 sequences of DUX4 
and FRGl together, the corresponding 5x5 cv accuracy is found 
to be a disappointing 39.60%. 

Visualization. Figure 5 shows a box plot for each gene in 
the FSHD-DB2 model. The first three genes in the model 



[Unknown-7905039, GDNF and EXTLl) tend to up-regulate 
heavily, this time in Healthy samples. The other three seem to 
contain complementary information in the variance rather than in 
the central tendency. Figure 6 depicts a dendrogram of cases and 
standardized gene expression levels for the FSHD-DB2 model. 
Each case is identified with an ID number, prefixed by a letter 
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Figure 4. Classification tree for tlie simplified model in the 
FSHD-DB1 database. The boxes are leaves indicating the prediction, 
the numbers of cases for each condition, and the overall percentage of 
covered cases. 

doi:1 0.1 371/journal.pone.0082071 .g004 

indicating class membership, //for Healthy and Ftor FSHD. It is 
apparent that the natural clusters are less homogeneous than those 
obtained for the FSHD-DBl database. Nonetheless, the group of 
central clusters (formed only by Healthy cases, H4 to H36) is 
clearly identified by GDMF and EXTLl, both genes showing a 
definite up-regulation in all cases. 

Statistical analysis. Again, statistical significance of individ- 
ual expression levels in the FSHD-DB2 model is assessed with a 
Mann-Whitney U-test (MWU) in the comparison between FSHD 
vs. Healthy samples. 

For Unknown- 79 05 03 9, medians for the two groups (FSHD and 
Healthy) were 2.41 and 2.62, resp.; the distributions in the two 
groups differed significantiy (MannWhitney W=118, /i-value 
9.22-10-5). 

For GDMF, medians for the two groups (FSHD and Healthy) 
were 6.76 and 7.22, resp.; the distributions in the two groups 
differed significandy (MannWhitney W = 114, /)-value 6.32-10-'). 

For EXTLl, medians for the two groups (FSHD and Healthy) 
were 6.98 and 7.26, resp.; the distributions in the two groups 
differed significandy (MannWhitney W = 96, p-value 9.90- 10^*). 

For the other three genes, the medians for the two groups are 
very close and the test is non-significant at the 95% level. This 
seems to confirm the previous interpretation of a first subgroup of 
three genes {Unknown-7905039, GDMF and EXTLl) that contain 
highly discriminant information in their means (or medians) and a 
second subgroup of another three genes {RPL36AP40, LGHMBP2 
and Unknown-8 147750) that complement the first group. Interest- 
ingly, this split fully coincides with the order in which the genes 
were discovered by the feature selection process in Algorithm 1 . 
The second-ranked gene, GDMF, is also chosen by the PAMR 
method (Table 5). 



Table 4. Best gene subset found using the proposed method 
and LDA as performance measure in FSHD-DB2 (the FSHD- 
DB2 model). 







Probe set ID 


Gene 


Name 


7905039 


Unknown 




8111670 


GDNF 


glial cell derived neurotrophic factor 


7899075 


EXTLl 


exostoses (multiple)-like 1 


7947152 


RPL36AP40 


ribosomal protein L36a pseudogene 40 


7942073 


IGHMBP2 


immunoglobulin mu binding protein 2 


8147750 


Unknown 





doi:l 0.1 371/journal.pone.0082071 .t004 



In contrast to the previous database, the genes in the FSHD- 
DB2 model seem quite different and, this time, no single gene can 
separate the two conditions neatiy; rather, they collaborate to 
reach a very high classification accuracy. Indeeed, the absolute 
value of Spearman's rank correlation coefficient is lower than 0.5 
in all cases, and specially low in the first subgroup of relevant 
genes. 

A final interpretable model. As for the previous database, 
accuracy may be sacrificed for increased interpretabUity of the 
model. Figure 7 shows a CART decision tree built with the FSHD- 
DB2 model. The interpretation of the tree is as follows: patients 
showing a value of GDMF lower than 6.8 are all classified 
(correctly) as having the FSHD condition, and this group 
constitutes 28% of the total; patients showing a value of GDMF 
greater than 6.8 and a value of EXTLl greater than 7.2 are all 
classified (correctiy) as not having the FSHD condition, and this 
group constitutes 34% of the total; for the final group (38% of the 
total), 12 patients are correctly identified as having the FSHD 
condition, and the remaining 7 are incorrectly identified as having 
the FSHD condition; thus the tree makes 7 false positives and no 
false negatives. 

Biological evidence. In this section, we compile scientific 
knowledge about the two genes in the final subset, including their 
known primary functions in cellular process. 

GDNF. GDMF is glial cell derived neurotrophic factor, a gene 
encoding a highly conserved neurotrophic factor. The recombi- 
nant form of the protein has been shown to promote the survival 
and differentiation of dopaminergic neurons in culture, and is able 
to prevent apoptosis of motor neurons induced by axotomy [18]. 
GDNF is also associated to the Hirschsprung disease (HSCR), a 
congenital disorder typically characterised by a part or all of the 
large intestine having no nerves and intestinal obstruction, due to 
an absence of intramural ganglia along the intestine [21]. 

EXTLl. EXTLl is exostoses (multiple) -lih: 1. This gene is a 
member of the multiple exostoses (EXT) family of glycosyltrans- 
ferases. The encoded protein is involved in chain elongation of 
some acidic complex polysaccharides found on the cell surface and 



Table 5. Best gene subset found using PAMR in FSHD-DB2. 







Probe set ID 


Gene 


Name 


8111892 


OXCTl 


3-oxoacid CoA transferase 1 


8062461 


LBP 


lipopolysaccharide binding protein 


8111670 


GDNF 


glial cell derived neurotrophic factor 



doi:1 0.1 371 /journal.pone.0082071 .t005 
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Table 6. Best gene subset found using SVM-RFE in FSHD-DB2. 



Probe set ID 


Gene 


Name 


7893282 


Unknown 




8129666 


SLC2A12 


solute carrier family 2 (facilitated glucose transporter}, member 12 


7926818 


Unknown 




8094938 


NIPALl 


NIPA-IIke domain containing 1 


7938667 


Unknown 


glial cell derived neurotrophic factor 



doi:1 0.1 371 /journal.pone.0082071 .t006 




Figure 5. Box plots for the expression levels of the genes in the FSHD-DB2 model. 

doi:1 0.1 371/journal.pone.0082071 .g005 
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Figure 6. Clustering of the expression levels of the genes in the FSHD-DB2 model. Left: by genes; Top: by samples. 
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in the extracellular matrix [18]. Mutation in EXTl is associated 
with hereditary multiple exostoses, a human disorder character- 
ized by the formation of cartilage-capped bony outgrowths at the 
epiphyseal growth plates [22]. 

Concluding Remarks 

The Facioscapulohumeral Muscular Dystrophy, or FSHD, is a 
highly rare muscle disease for which there is no known cure 
nowadays. Two databases presenting samples of both healthy and 
FSHD patients have been analyzed with machine learning (ML) 
methods. There is hardly any precedent in the literature 
addressing this disease with these techniques. 

The fact that the FSHD data analyzed in this study are scarce 
and of high dimensionality makes their computer-based automat- 
ed classification a difficult undertaking. Most importantly, this 
high dimensionality precludes a straightforward interpretation of 
the obtained results, limiting their usability in a practical medical 
setting. In this vein, computational solutions like the one reported 
here should reckon the need of reporting not only highly accurate 
models: they should also represent low complexity and interpret- 
able solutions amenable to further analysis by experts. 

We have devised an approach to prediction of the FSHD 
condition from gene expression profiling, comprising an effective 
algorithm for gene selection enhanced with a mechanism for tie- 
breaking and based on a fairly standard classifier. To demonstrate 
its effectiveness, we show that the method was highly efficient in 
identifying two subsets of genes that best characterize each class. 
In both cases, the discrimination process is shown very 
conveniently as a two-question decision tree. We have also 
provided evidence for the statistical significance and stability of the 
result. Our method delivers highly interpretable solutions that are 




Figure 7. Classification tree for the simplified model in the 
FSHD-DB2 database. The boxes are leaves indicating the prediction, 
the numbers of cases for each condition, and the overall percentage of 
covered cases. 

doi:1 0.1 371/journal.pone.0082071 .g007 



more accurate than competing methods. The technique is general 
and could be used in other similar scenarios. 

However, in small sample scenarios, there is a high risk of 
overfitting the data: small samples will appropriately support only 
simple models with few parameters (acting as the coefficients of the 
features). Moreover, the use of a classifier having one or more 
hyper-parameters (these are parameters that the classifier cannot 
determine in its training process, and must be determined 
externally) is unaffordable, since this would require an additional 
resampling loop, for which there would almost be no data left. As a 
consequence, the determination of these parameters would be 
subject to a very high degree of uncertainty. We have selected 
Linear Discriminant Analysis (LDA) as the target classifier, using 
equal-covariance Gaussians to approximate class conditional 
probability densities. This choice corresponds to a linear, stable 
and parameter-free classifier. The LDA recognition rate was 
resampled using 5 times 5-fold cross validation. 

One should bear in mind that the excellent reported results do 
not -by themselves- entail a medical solution to the disease, a 
situation that is faced by aU statistical and ML solutions. On the 
contrary, a main goal of exploratory studies of this kind should be 
aimed towards understanding how the variables selected by the 
model fit in relation to prior knowledge from the medical domain. 

Materials and Methods 

The FSHD Databases 

The first database used in this contribution was obtained from 
the EMBL-EBI repository of the European Bioinformatics 
Institute [23]. Specifically, the Experiment E-GEOD-3307 uses the 
Affymetrix GeneChip Human Genome HG-133A and HG- 
U133B designs to analyse a range of muscle diseases for gene 
expression comparative profiling purposes. A total of 1 2 1 muscle 
samples of 1 1 muscle pathologies (plus several healthy samples) 
integrate the data: acute quadriplegic myopathy, juvenile derma- 
tomyositis, amyotophic lateral sclerosis, spastic paraplegia, fascios- 
capulohumeral muscular dystrophy, Emery Dreifuss muscular 
dystrophy, Becker muscular dystrophy, Duchenne muscular 
dystrophy, calpain 3, dysferlin, and the FKRP using U133A and 
U133B array design. These are diseases with a extremely low 
incidence rate in the general population. The Facioscapulohu- 
meral Muscular Dystrophy (FSHD), the targeted group in this 
work, consists of 14 FSHD samples and 18 healthy samples 
described by 22,283 genes or features (HG-133A version). 

The second database was obtained from the GEO (Gene 
Expression Omnibus) repository, a publicly available site in the 
National Center for Biotechnology Information (NCBI). The 
Experiment GSE36398, "Transcriptional profiling in facioscapulo- 
humeral muscular dystrophy to identify candidate biomarkers" is a 
very recent database containing FSHD information only. Using 
the Affymetrix Human Gene 1.0 ST Array, the experiment 
analyses RNA extracted from both biceps and deltoids of FSHD 
subjects (26 samples) and unaffected first-degree relatives (24 
samples), rendering a dataset that consists of 50 samples, described 
by 33,297 genes or features [24]. 

There are no missing data in any of the two datasets; and both 
contain a mixture of positive and negative examples, necessary for 
learning. Moreover, in both cases the whole datasets were used. 

Linear Discriminant Analysis 

Linear and quadratic discriminant analyses or LDA/QDA 
(Duda et al. 2001) are widely used parametric methods which 
assume that the class distributions are multivariate Gaussians. 
With LDA, all classes are assumed to have the same covariance 
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matrix. QDA does not need such an assumption; however, the 
number of parameters to be estimated from the data available for 
each class is much higher, entailing lower statistical significance. 

In both methods, classification is achieved by assigning an 
example to the class for which the posterior probability P(oJk\ x ) 
is greater, or equivalently for which ln{P{a)k)p{ x \(Ok)) is 
greater. 

These methods are attractive because they need no parameter 

tuning, and their limited complexity (quadratic at most) may be a 
solid guard against overfitting the data. Moreover, for LDA fast 
updating procedures exist for the computation of certain forms of 
the cross-vahdation error [25] . The discriminant Junction for class cot 
is expressed as: 

gk(x)= In {P{wk)p(x\(Ok)} 



in P{wk)- In {{lnf\Lk\h " " i^k)%\x- Hk) 



which simplifies to: 



gk(x)= In P{mk)-^ (In +(x - Hk)'^k\^ - l^k)) 



If we assume that all class-conditional distributions p(x\u)k) have 
the same covariance matrix 2, we get: 



gk{x)=\nP{a)k)+ Hi's. '/ifc 



given a set of examples labeled into one of two classes, the linear 
SVM finds their optimal linear separation: this is the hyperplane 
that maximizes the minimum orthogonal distance to a point of 
either class (this distance is called margin of the separation). 

Consider again an i.i.d data sample S= {x\, . . . ,Xjv} of training 
patterns (in n dimensions), labelled into two classes coi,co2 by 
z\, . . . ,zn, with Zj = + 1 if X iSWi and z,- = — 1 if x i6(»2. If we 
set up an affine function g{x) = (^w,xy+b, then we have a linear 
discriminant as sgn(g{ x )), for which we would like: 



<w,x,>-|-&>0 



iw,Xi} + b<0 



X iECOi (Zi = + 1) 



X iSW2 (Zj = — 1) 



In short, z,«M',x,>-|-6)>0, or z,-g(x,)>0, for all 1</<A^. 
Given the hyperplane n : g{x) = 0, the perpendicular distance 



from x to n is d(x,n) = 



\gix)\ 



The support vectors are those x 



closest to the hyperplane. Rescaling w,h su[:h that |<(vv,x) + 6| = 1 
for these closest points, one obtains I^WjX) > 1. The support 
vectors are now those {x,/|<H',x,-) + 6| = 1}. 

The margin m(n) of a plane n can now be written as twice its 

2 

distance of any support vector: m{7l) = 2d(xsy,Ti)= t. — r-, where 

ll^ll 

|^(xsv)| = 1- To maximize the margin, we should minimize || w \\ 
subject to z,(<(w,x,>-|-Z))> 1, for all l<i<N. 

In the case where an hyperplane does not exist that can separate 
correctly the points in the data sample, a set of non-negative slack 
\ ariables are introduced to allow for small marffn violations, leading 
to a soft margin: 



These are linear discriminant functions (linear in x) and 

the decision boundaries gi(x) = gj{x) are hyperplanes in n- 
dimensional space. 

In practical situations, only an i.i.d data sample S is available. 
When means, covariances and priors for every class are not 
available, maximum-likelLhood estimates on S can be used, 
although in this case the Bayesian optimality properties are no 
longer valid. Let Sk^ S be the subset of samples known to belong 
to class (Ok- Then 5i, . . . ,5^ is a partition of 5. Unbiased estimates 
for the vector means and for the class priors can be obtained as: 



Zi{<w,Xiy + b) + ii>l i=l,...,N 



(1) 



where ^, >0. For an error to occur, the corresponding ^, must 
exceed unity, and so c^, is an upper bound on the number of 
training errors. The optimal separating hyperplane can be found as 
the solution of the 1-norm Quadratic Programming problem: 



minjil w\\' + Cj2ii 



H k~ H k = 



1 

JSk\ 



P{oJk)~P{mk) = 



\S\ 



The following pooled covariance matrix is then used: 
1 



where 



'pooled - 



1 



Y,(\Sk\-\)tk 



^fc = I o i' , X] (x-fik)ix-fik)' 

Sk 



s.t. Zi{{w,Xiy+b)>\ - S,i,i= l,...,N 

The solution to this optimization problem corresponds to the 
saddle point of its associated Lagrangian: 



N N 



^a,(z,«M',x,> + i.)-l + i,) + C^i,- ^ft^, 



;=1 



where ai,/^, >0 for /=!,... ,N. 

Once this QP problem is solved, the solution vector w* can be 
expressed as a hnear expansion over the support vectors: 



Linear Support Vector Machines 

The support vector machine (SVM) is a machine learning 
method solidly based on statistical learning theory [26] . Intuitively, 



(2) 



The support vectors are precisely those x ,e5 for which a* > 0. 
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Resampling Methods 

Model selection is concerned with the process of finding the 
optimal model for a set of samples among a set of candidate 
models. Resampling methods aim at making a better use of the 
available data. These methods are very useful for assessing how a 
predictive model that can be the result of a complex modeling 
process will perform in practice. 

The generic goal of cross-validation (CV) is to estimate the 
expected error of a model in a data set that is independent of the 
data that were used to train the model. One round of k-fold CV 
(or k-GV) involves partitioning the sample into k complementary 
subsets, systematically performing the modeling on the union of 
k—l such subsets and checking the obtained model on the 
remaining subset (acting as a validation set). The result of k-CV is 
an estimation of the error if only a fraction (k~l)/k of the 
available data is used. This error is expected to be conservative 
(larger than the error obtained if the entire sample was used). To 
reduce variability, multiple rounds can be performed using 
different partitions, and the results averaged over the rounds. 

The Feature Selection Algorithm 

Feature selection can be seen as a search problem, where each 
state in the search space corresponds to a subset of the features. In 
the ML literature, a wide family of suboptimal algorithms depart 
from an initial solution and iteratively add or delete features by 
locally optimizing the error function. In forward selection, features 
are progressively incorporated into larger subsets; in backward 
sekction (or elimination) one starts with the full set of features and 
progressively eliminates elements from it. 

Wrappers are often criticized because they are computationally 
very expensive. Moreover, feature selection is badly affected by 
small sample sizes, producing overly optimistic results and 
introducing an excess of variance in the readings. This is 
aggravated in the presence of very sophisticated search algorithms 
[27]. On the other hand, greedy search strategies seem to be 
particularly computationally advantageous and may alleviate the 
problem of overfitting [28]. Nevertheless, traditional pure for- 
wardd selection and backward elimination search algorithms are 
ill-advised in that they cannot rectify their decisions and may end 
up delivering poor solutions both in terms of quality and size. 

To reduce the number of genes and obtain small subsets of 
highly relevant genes, we use a simple but effective forward- 
backward feature selection algorithm. This algorithm follows the 
wrapper idea, i.e., the feature selection algorithm uses a learner as 
a subroutine in tlu; search for good subsets [29]. In this general 
setting, when features are added or removed from the current 
subset the algorithm resorts to some performance measure - 
commonly the resampled rate of recognition. 

An interleaved forward-backward search is developed looking 
for the improvement in performance of the chosen performance 
measure. The algorithm is described as the listing Algorithm 1. 
Given a performance measure C to be maximized (in this case, the 
resampled evaluation of a classifier in a data sample), the 
algorithm searches the space of subsets by adding/removing 
features in a hill-climbing fashion. 

Specifically, in every iteration of the outer loop, one feature is 
added to the current b(;st solution BEST, as long as this step 
improves on current performance C^"' . Then a variable number of 
feature removal steps is carried out, inasmuch the same condition 
of improved performance is met. This scheme is oriented to favour 
solutions with low numbers of features. The outer iteration also 
ends when no further improvement is observed. This strategy 
bears some resemblances with a floating search algorithm in its 
forward version [30]. However, it has a far lower computational 



cost given that discarded features are not considered again for 
another inclusion round. Note also that current subset perfor- 
mance is not compared specifically against the best performance 
achieved for the same size of the current subset (as floating methods 
do). It should be mentioned that the algorithm itself needs no 
parameter specification, although the chosen performance mea- 
sure could have. 

Algorithm 1 Forward-Backward gene feature selection. 

1: Input: S = {s\, . . . ,s„}: FuU feature set; 

C: Class feature (Healthy, FSHD) 

£ : 2'^->IR: performance measure, to be maximized 

2:BEST^arg'^^'^Ci{st}) 

3: C""^C{{BEST}) 
4: S^S\{BEST} 
5: repeat 

6: ***Forward Stage*** 

7: arg "^^l C(BEST[J{si}) 

8: C"''"'^CiBEST{J{^''}) 

9: it >£""- then. 

10: BEST^BEST\J{^<^] 

11: Cf^^^C"^ 

12: 

13: end if 

14: ***Backward Stage*** 
15: repeat 

16: y"-- arg ^.JJJ^^ L{BEST\{sr\) 

17: C"''^li{BEST\{^^y) 
18: if r"^'" >£'"'• then 
19: BEST^BEST\{^"} 
20: C,"''^C'^ 
21: end if 

22: until BEST does not change 
23: until BEST does not change 
24: Output: BEST: Optimized feature subset 

As explained in the introduction, we ar<; interc'stcd in a solution 
that combines high predictive performance, very small size {i.e., a 
very low number of useful genes), admits visualization and 
interpretation, and hopefully may bear biological relevance. 

To this end, we explicit now how the methods previously 
described glue together. The performance measure C to be 
maximized in Algorithm 1 is the accuracy rate of LDA. This 
recognition rate is resampled using 5 times 5-fold cross validation 
(5x5 cv for short), following common practices in the literature 
[31]. 

Due to the low number of samples, ties among the performance 
measure can happen easily. As a consequence, the gene subset 
selection process will end up in different final solutions, something 
that is not desirable in general [32]. How these ties are broken is 
non-trivial and should be addressed specifically and exphcitly. 
However, the literature does not seem to offer any formal solution 
or procedure. Univariate methods as entropy-based measures 
[33,34], the Fisher Score [35], or some other statistical measures 
could be those preferred for their simplicity -see e.g. [36,37]. 
Instead, a multivariate feature ranking method seems much more 
adequate to measure the relevance of a group of tied features. 

As explained above, linear support vector machines (SVMs) can 
be seen as linear discriminant classifiers. Indeed, the numbers 
{yv*j f' in eq. (2) have been used as a surrogate for the relevance of 
the i-th gene since the pioneering work of [38]. Notice that our 
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approach is different in that predictive performance is the main 
criterion for optimization. Only in case of ties is the magnitude of 
the SVM weight vector being used. This is because the relation 
between this magnitude and final performance is rather indirect. 
This margin-based tie-breaking procedure has been incorporated 
into the feature selection algorithm. It is used every time an 
evaluation of the performance measure may incur on one or more 
ties -lines 2,7 and 16 in Algorithm 1. 

Other Methods 

Prediction analysis for microarrays. PAMR (Prediction 
Analysis for Microarrays) performs sample classification from gene 
expression data, via the nearest shrunken centroid method [39]. 
Similarly to the proposed method, PAMR estimates prediction 
error via cross-validation and provides a list of significant genes 
whose expression characterizes each diagnostic class. 

Support vector machine for recursive feature 
elimination, SVM-RFE (Support Vector Machine - Recursive 
Feature Elimination) [38] has been used widely with great success 
in microarray data analysis, particularly for disease gene finding. It 
largely eliminates redundant genes and usually yields very 
compact gciic subsets. The genes are eliminated according to a 
ranking rclatc'd to weight magnitude in the S\'M solution. This is 
the same criterion for tie-breaking described in the previous 
section. 

Software Implementation 

Algorithm 1 was implemented entirely in IV'IATLAB language, 
version 2012a. The computer codes were run on an Ubuntu Linux 
server version 11.10 with an Intel(R) Xeon(R) CPU E,5620 @ 
2.40 GHz and 8 cores. The deployed solution to Algorithm 1 
takes advantage of the possibility to parallelize parts of the code, 
particularly lines 2, 7 and 16. In an 8-core scenario, eight genes or 
features can be evaluated at the same time. The complete software 
and instructions to reproduce the experiments described in this 
paper (or to conduct new ones) is available at http://nova.mxl. 
uabc.mx/femando/PO/for the interested reader. 
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